Climate and Ground Water Trends in Columbus, Ohio¶
1. Introduction¶
Context & Importance
This report is the analysis of climate trends and ground water levels in Columbus OH, using data obtained form NOAA website and USGS website. Columbus, OH has a humid continental climate (Dfa), characterized by four distinct seasons:
Summers are typically warm to hot and humid, with average high temperatures in July reaching the mid-80s °F (around 30 °C).
Winters are cold with average high temperatures in January around the upper 30s °F (about 3−4 °C) and lows in the low to mid-20s °F (around −5 °C). The city receives moderate snowfall.
Spring and Autumn are generally mild transitional seasons.
Precipitation is moderate and relatively consistent throughout the year, with a slight peak in the late spring and summer.
Research Questions (RQ)
How have annual and seasonal average air temperatures changed from?
Has the frequency of extreme temperature and precipitation events changed over this period?
How do temperature anomalies relate to precipitation patterns in Columbus?
Is there a trend in groundwater levels over time?
Are there seasonal patterns in groundwater levels?
Are extreme highs or lows becoming more common?
Data Description
Daily climate observations from 1946–2015.
Variables: date, minimum/maximum temperature, temperature at time of observation, precipitation.
Daily ground water levels observations from 2005-2015.
Variables: date, ground water levels.
Data Preprocessing
Handling missing data (imputation or exclusion).
Aggregating daily data into monthly, seasonal, annual summaries.
Statistical Tools & Visualizations
Trend analysis: linear regression, Mann–Kendall test for monotonic trends.
Comparative analysis: decadal averages.
Visualization: line charts, bar charts, heatmaps.
Correlation/regression for temp–precip relationship.
# libraries
import pymannkendall as mk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
from scipy.stats import linregress, f_oneway, ks_2samp, pearsonr, spearmanr, kruskal
# import data; change DATE column to datetime
climate_df = pd.read_csv('climate_data_columbus_oh.csv')
climate_df['DATE'] = pd.to_datetime(climate_df['DATE'])
climate_df = climate_df[climate_df['DATE'].dt.year <= 2014]
print("The size of the data: ", climate_df.shape[0], "rows and", climate_df.shape[1], "columns.")
The size of the data: 24445 rows and 7 columns.
print(climate_df.isnull().sum())
STATION 0 NAME 0 DATE 0 PRCP 417 TMAX 360 TMIN 420 TOBS 353 dtype: int64
The numeric variables contain a few missing data.
print(climate_df.isnull().mean() * 100)
STATION 0.000000 NAME 0.000000 DATE 0.000000 PRCP 1.705870 TMAX 1.472694 TMIN 1.718143 TOBS 1.444058 dtype: float64
The output gives % of missing per column. The rule of thumb is, if it’s under ~5%, it’s usually safe to impute or even drop; if it’s much higher, we need to be more cautious. The numeric columns in our dataset are all less than 5%.
climate_df.dropna(inplace = True)
print("The size of the data after dropping missing values: ", climate_df.shape[0], "rows and", climate_df.shape[1], "columns.")
The size of the data after dropping missing values: 23751 rows and 7 columns.
climate_df.head()
| STATION | NAME | DATE | PRCP | TMAX | TMIN | TOBS | |
|---|---|---|---|---|---|---|---|
| 0 | USC00331783 | COLUMBUS VLY CROSSING, OH US | 1946-01-01 | 0.02 | 27.0 | 15.0 | 23.0 |
| 1 | USC00331783 | COLUMBUS VLY CROSSING, OH US | 1946-01-02 | 0.00 | 30.0 | 10.0 | 28.0 |
| 2 | USC00331783 | COLUMBUS VLY CROSSING, OH US | 1946-01-03 | 0.00 | 37.0 | 25.0 | 37.0 |
| 3 | USC00331783 | COLUMBUS VLY CROSSING, OH US | 1946-01-04 | 0.00 | 45.0 | 36.0 | 44.0 |
| 4 | USC00331783 | COLUMBUS VLY CROSSING, OH US | 1946-01-05 | 0.00 | 60.0 | 36.0 | 59.0 |
Taking a look at the first 5 rows of our data, all columns are of correct type. Data is ready for further analysis.
climate_df['year'] = climate_df['DATE'].dt.year
# Annual averages
annual_temps = climate_df.groupby('year')[['TMIN', 'TMAX']].mean()
# Calculate the mean of TMIN and TMAX
mean_tmin = annual_temps['TMIN'].mean()
mean_tmax = annual_temps['TMAX'].mean()
fig = go.Figure()
# Add min temp line
fig.add_trace(go.Scatter(
x=annual_temps.index,
y=annual_temps['TMIN'],
mode='lines',
name='Avg Min Temp',
line=dict(color='steelblue')
))
# Add max temp line
fig.add_trace(go.Scatter(
x=annual_temps.index,
y=annual_temps['TMAX'],
mode='lines',
name='Avg Max Temp',
line=dict(color='tomato')
))
# Add mean min temp line
fig.add_trace(go.Scatter(
x=annual_temps.index,
y=[mean_tmin]*len(annual_temps),
mode='lines',
name=f'Long Term Avg Min Temp ({mean_tmin:.2f}°F)',
line=dict(color='black', dash='dash')
))
# Add mean max temp line
fig.add_trace(go.Scatter(
x=annual_temps.index,
y=[mean_tmax]*len(annual_temps),
mode='lines',
name=f'Long Term Avg Max Temp ({mean_tmax:.2f}°F)',
line=dict(color='black', dash='dash')
))
# Layout
fig.update_layout(
title="Annual Average Min and Max Temperatures (Columbus, 1946–2014)",
xaxis_title="Year",
yaxis_title="Temperature (°F)",
template="simple_white",
legend=dict(bordercolor="gray", borderwidth=0.5),
width=900,
height=500
)
fig.show()
Both trend lines oscillate closely around their respective long-term mean lines, indicating that while there are short-term fluctuations, the overall averages are stable, though the movement around the mean shows asymmetric warming.
Seasonal Trends (Winter vs. Summer)¶
climate_df['month'] = climate_df['DATE'].dt.month
climate_df['season'] = pd.cut(climate_df['month'], bins=[0,2,5,8,11,12],
labels=['Winter','Spring','Summer','Fall','Winter'], right=True, ordered=False)
seasonal_temps = climate_df.groupby(['year','season'])[['TMAX','TMIN']].mean().reset_index()
# --- Calculate the overall average temperature for each season/year combination ---
seasonal_temps['TAVG'] = (seasonal_temps['TMAX'] + seasonal_temps['TMIN']) / 2
# --- Calculate the Overall Mean TAVG for Winter and Summer ---
# Overall mean AVERAGE temperature for Winter across all years
mean_winter_tavg = seasonal_temps[seasonal_temps['season'] == 'Winter']['TAVG'].mean()
# Overall mean AVERAGE temperature for Summer across all years
mean_summer_tavg = seasonal_temps[seasonal_temps['season'] == 'Summer']['TAVG'].mean()
# Dictionary to hold the means for easy lookup in the loop
mean_temps_avg = {
'Winter': mean_winter_tavg,
'Summer': mean_summer_tavg
}
# Define colors for seasons
colors = {'Winter': 'steelblue', 'Summer': 'tomato'}
fig = go.Figure()
for season in ['Winter', 'Summer']:
subset = seasonal_temps[seasonal_temps['season'] == season]
color = colors[season]
mean_val = mean_temps_avg[season]
# Plot seasonal average temperature
fig.add_trace(go.Scatter(
x=subset['year'],
y=subset['TAVG'],
mode='lines',
name=f'{season} Avg Temp',
line=dict(color=color, width=2)
))
# Plot overall mean line
fig.add_trace(go.Scatter(
x=subset['year'],
y=[mean_val]*len(subset),
mode='lines',
name=f'Overall Mean {season} TAVG ({mean_val:.2f}°F)',
line=dict(color=color, dash='dash'),
))
# Layout
fig.update_layout(
title='Annual Average Winter and Summer Temperatures (Columbus, 1946–2014)',
xaxis_title='Year',
yaxis_title='Temperature (°F)',
template='simple_white',
width=900,
height=500,
legend=dict(bordercolor="gray", borderwidth=0.5)
)
fig.show()
Winter temperatures exhibit much greater inter-annual variability (larger swings above and below the mean) than summer temperatures. This is typical, as winter weather is more heavily influenced by large-scale, rapidly changing atmospheric patterns (cold fronts, arctic air masses).
Summer temperatures are highly stable, remaining close to the 72°F mean for most of the period, reflecting the stabilizing effect of consistent solar radiation and local climate factors during the warmest months.
climate_df['decade'] = (climate_df['year']//10)*10
extremes = climate_df.groupby('decade').agg(
hot_days = ('TMAX', lambda x: (x>90).sum()),
cold_days = ('TMIN', lambda x: (x<0).sum())
)
# Create the bar chart
fig = go.Figure()
# Hot days
fig.add_trace(go.Bar(
x=extremes.index,
y=extremes['hot_days'],
name='Hot Days (>90°F)',
marker_color='tomato'
))
# Cold days
fig.add_trace(go.Bar(
x=extremes.index,
y=extremes['cold_days'],
name='Cold Days (<0°F)',
marker_color='steelblue'
))
# Layout
fig.update_layout(
barmode='group', # side-by-side bars
title='Extreme Temperature Days per Decade (Columbus, 1946–2014)',
xaxis_title='Decade',
yaxis_title='Number of Days',
template='simple_white',
width=900,
height=500,
legend=dict(bordercolor="gray", borderwidth=1)
)
fig.show()
The plot clearly illustrates an asymmetric change in climate extremes: the risk associated with extreme cold has drastically reduced, while the risk associated with extreme heat has generally increased and remained high throughout the late 20th and early 21st centuries.
Frequency of Heavy Rainfall Days (>1 inch/day)¶
rain_extremes = climate_df[climate_df['PRCP'] > 1].groupby('decade').size()
# Create the bar chart
fig = go.Figure()
fig.add_trace(go.Bar(
x=rain_extremes.index,
y=rain_extremes.values,
name='Heavy Rainfall Days (>1 inch/day)',
marker_color='CornflowerBlue'
))
# Layout
fig.update_layout(
title='Heavy Rainfall Days per Decade (>1 inch/day)',
xaxis_title='Decade',
yaxis_title='Number of Days',
template='simple_white',
width=900,
height=500
)
fig.show()
The plot clearly illustrates increasing trend in the frequency of heavy rainfall events over the period, peaking around the turn of the century, before a sharp drop in the 2010s.
# Filter for summer (already done)
summer_daily = climate_df[climate_df['season'] == "Summer"].copy()
# Add small offset to avoid log(0)
PRCP_OFFSET = 0.01
summer_daily['PRCP_offset'] = summer_daily['PRCP'] + PRCP_OFFSET
# Create scatter plot with color gradient based on precipitation
fig = px.scatter(
summer_daily,
x='TMAX',
y='PRCP_offset',
color='PRCP', # original precipitation for color intensity
color_continuous_scale='Teal', # blue-green gradient
opacity=0.6,
hover_data={'TMAX': True, 'PRCP': True, 'PRCP_offset': False},
labels={'TMAX': 'Daily Max Temp (°F)', 'PRCP_offset': 'Daily Precipitation (inches + 0.01)'},
title='Daily Max Temperature vs Precipitation in Summer (Columbus, 1946–2014)'
)
# Add simple rolling mean trend line
sorted_data = summer_daily.sort_values('TMAX')
tmax = sorted_data['TMAX']
trend = sorted_data['PRCP_offset'].rolling(30, min_periods=1, center=True).mean()
fig.add_traces(go.Scatter(
x=tmax,
y=trend,
mode='lines',
line=dict(color='tomato', width=2),
showlegend=False
))
# Log y-axis
fig.update_yaxes(type='log', title_text='Daily Precipitation (inches + 0.01)')
fig.update_layout(
width=900,
height=600,
template='simple_white',
coloraxis_colorbar=dict(title="Precipitation (inches)")
)
fig.show()
The plot illustrates the relationship between Daily Maximum Temperature (TMAX) and Daily Precipitation (PRCP) during summer months in Columbus (1946–2014). Each point represents a day, and precipitation values are shown on a logarithmic scale (with a 0.01-inch offset added to handle zero values). This transformation allows both frequent light rainfall events and rare heavy rainfall extremes to be visualized on the same axis.
The red line represents a rolling-mean smoothing curve, highlighting the central trend between temperature and precipitation.
Key Patterns Observed:
Nonlinear Relationship
The overall pattern is clearly nonlinear. The smoothed trend line and the scatter distribution show that precipitation does not increase or decrease in a simple, straight-line fashion with temperature.
Low to Moderate Temperatures (10°F–55°F):
The trend line is relatively flat and elevated near 0.01 inches (the offset floor), suggesting a higher average probability of rainfall compared to hotter days. While many dry days exist, light precipitation events are more common in this cooler range.
Moderate to Warm Temperatures (60°F–85°F):
The trend line declines as temperatures rise, indicating that typical summer highs correspond with fewer average rainfall days. The log scaling makes it clear that while most days are dry, occasional moderate rainfall occurs.
High Temperatures (85°F–105°F):
The trend line approaches the offset baseline, meaning nearly all of these hottest summer days are dry. The scattered nonzero points in this range represent rare rain events, often extreme, but they are statistical outliers.
Precipitation Extremes:
The largest rainfall events (exceeding 1 inch/day) cluster in the mid-temperature range (65–90°F), consistent with conditions favorable for intense summer thunderstorms—warm enough to provide convective energy, but not so hot as to signal drought conditions.
In summary, the log-scaled plot shows that cooler summer days have a higher likelihood of light to moderate rainfall, mid-range temperatures host the most intense storm events, and the hottest summer days are overwhelmingly dry.
Correlation Heatmap (between average min, max temperatures and total precipitation)¶
annual = climate_df.groupby('year').agg(
avg_tmax=('TMAX','mean'),
avg_tmin=('TMIN','mean'),
total_precip=('PRCP','sum')
)
# Compute correlation matrix
corr_matrix = annual.corr()
# Heatmap
fig = px.imshow(
corr_matrix,
text_auto=True, # show correlation values on cells
color_continuous_scale='RdBu_r', # similar to 'coolwarm'
origin='upper',
labels=dict(x="Climate Variable", y="Climate Variable", color="Correlation")
)
fig.update_layout(
title='Correlation Between Annual Climate Variables',
width=700,
height=600
)
fig.show()
The key interpretation is that temperatures are strongly correlated with each other, but only weakly correlated with precipitation. In simple terms: knowing how hot or cold a year was (max or min temp) tells us very little about how much total precipitation that year received.
# Annual mean temperature
annual_mean = climate_df.groupby('year')[['TMAX','TMIN']].mean().reset_index()
# Test trend in max temps
slope, intercept, r_value, p_value, std_err = linregress(annual_mean['year'], annual_mean['TMAX'])
print(f"Max Temp Trend: slope={slope:.3f} °F/year, p={p_value:.4f}, R²={r_value**2:.3f}")
# Test trend in min temps
slope, intercept, r_value, p_value, std_err = linregress(annual_mean['year'], annual_mean['TMIN'])
print(f"Min Temp Trend: slope={slope:.3f} °F/year, p={p_value:.4f}, R²={r_value**2:.3f}")
Max Temp Trend: slope=-0.001 °F/year, p=0.9385, R²=0.000 Min Temp Trend: slope=-0.020 °F/year, p=0.0598, R²=0.052
Days Are Not Warming (Max Temp): The trend is flat (slope≈0). The high p-value (0.9385) confirms that any small change observed is likely due to random year-to-year variation, not a consistent long-term trend.
Nights Are Warming (Min Temp): The average minimum temperature is increasing by +0.020°F every year. Over the 68 years, this equals about a 1.4°F total increase. The p-value (0.0598) is close to the significance threshold, making this warming trend likely real.
Narrowing Temperature Range: Since the nights are warming and the days are not, the difference between the daily high and low temperatures (the Diurnal Temperature Range) is getting smaller.
Mann–Kendall Non-Parametric Test (trend robustness)¶
The Mann–Kendall Non-Parametric Test is a statistical test used to assess whether there is a monotonic trend (consistently increasing or decreasing pattern) in a time series dataset. It's often referred to as a "trend robustness" test because it makes very few assumptions about the underlying distribution of the data, making it robust to non-normal data and outliers.
mk_test_max = mk.original_test(annual_mean['TMAX'])
mk_test_min = mk.original_test(annual_mean['TMIN'])
print("Mann-Kendall Max Temp s-statistic and p-value:", mk_test_max.s, ",", round(mk_test_max.p, 2))
print("Mann-Kendall Min Temp s-statistic and p-value:", mk_test_min.s,",", round(mk_test_min.p, 2))
Mann-Kendall Max Temp s-statistic and p-value: 82.0 , 0.67 Mann-Kendall Min Temp s-statistic and p-value: -134.0 , 0.49
The Mann–Kendall trend test indicates no statistically significant long-term monotonic trend in maximum or minimum daily temperatures in Columbus, OH, from 1946 to 2014 (p > 0.05 for both). Although the minimum temperature series shows a weak negative S statistic, this trend is not significant.
# Hot days >90°F, cold days <0°F
hot_days = climate_df.groupby('year')['TMAX'].apply(lambda x: (x>90).sum())
cold_days = climate_df.groupby('year')['TMIN'].apply(lambda x: (x<0).sum())
# Regression test for hot days
slope, intercept, r_value, p_value, std_err = linregress(hot_days.index, hot_days.values)
print(f"Hot Days Trend: slope={slope:.2f} days/year, p={p_value:.4f}")
# Regression test for cold days
slope, intercept, r_value, p_value, std_err = linregress(cold_days.index, cold_days.values)
print(f"Cold Days Trend: slope={slope:.2f} days/year, p={p_value:.4f}")
Hot Days Trend: slope=0.08 days/year, p=0.1272 Cold Days Trend: slope=0.00 days/year, p=0.8740
Trend analysis of temperature extremes in Columbus, OH (1946–2014) showed no statistically significant change in the frequency of hot or cold days. Hot days exhibited a slight positive trend (+0.08 days/year, p = 0.1272), while cold days showed no change (+0.00 days/year, p = 0.8740). These results suggest that the observed variations are consistent with natural variability rather than a persistent long-term trend.
Change in Heavy Rainfall Days¶
heavy_rain = climate_df.groupby('year')['PRCP'].apply(lambda x: (x>1).sum())
slope, intercept, r_value, p_value, std_err = linregress(heavy_rain.index, heavy_rain.values)
print(f"Heavy Rainfall Days Trend: slope={slope:.2f} days/year, p={p_value:.4f}")
Heavy Rainfall Days Trend: slope=0.05 days/year, p=0.0018
Test indicates a statistically significant increase in heavy rainfall days in Columbus, OH, from 1946 to 2015 (slope = +0.05 days/year, p = 0.0018). This corresponds to an increase of nearly 3 additional heavy rainfall days over the 70-year study period, suggesting a shift toward more frequent extreme precipitation events.
Correlation Between Summer Temp & Precip¶
print("---------------------------------------------------------------------------------")
print("Pearson & Spearman Correlation Statistics for Summer Max Temp and Precipitation:")
pearson_r, pearson_p = pearsonr(summer_daily['TMAX'], summer_daily['PRCP'])
print(f"Pearson r={pearson_r:.3f}, p={pearson_p:.4f}")
spearman_r, spearman_p = spearmanr(summer_daily['TMAX'], summer_daily['PRCP'])
print(f"Spearman ρ={spearman_r:.3f}, p={spearman_p:.4f}")
print("---------------------------------------------------------------------------------")
print("Pearson & Spearman Correlation Statistics for Summer Min Temp and Precipitation:")
pearson_r, pearson_p = pearsonr(summer_daily['TMIN'], summer_daily['PRCP'])
print(f"Pearson r={pearson_r:.3f}, p={pearson_p:.4f}")
spearman_r, spearman_p = spearmanr(summer_daily['TMIN'], summer_daily['PRCP'])
print(f"Spearman ρ={spearman_r:.3f}, p={spearman_p:.4f}")
print("---------------------------------------------------------------------------------")
--------------------------------------------------------------------------------- Pearson & Spearman Correlation Statistics for Summer Max Temp and Precipitation: Pearson r=-0.058, p=0.0000 Spearman ρ=-0.097, p=0.0000 --------------------------------------------------------------------------------- Pearson & Spearman Correlation Statistics for Summer Min Temp and Precipitation: Pearson r=0.186, p=0.0000 Spearman ρ=0.277, p=0.0000 ---------------------------------------------------------------------------------
Correlation analysis shows statistically significant but weak associations between summer temperatures and daily precipitation in Columbus, OH (1946–2015). For maximum temperatures, Pearson’s correlation (r = -0.058, p < 0.001) and Spearman’s rank correlation (ρ = -0.097, p < 0.001) both indicate a small negative relationship: hotter summer days are slightly more likely to be dry. In contrast, for minimum temperatures, Pearson’s correlation (r = 0.186, p < 0.001) and Spearman’s rank correlation (ρ = 0.277, p < 0.001) reveal a modest positive relationship: warmer nights are associated with higher daily precipitation.
Overall, these results suggest that while hot summer days often coincide with dryness, warmer nighttime conditions are more conducive to rainfall.
import dataretrieval.nwis as nwis
# USGS site code for ground water levels in Columbus OH.
site = '400540082540400'
# retrieve data
gwl_df = nwis.get_record(sites=site, service='dv', start='2005-10-06', end='2025-09-26', parameterCd='72019')
# reset index and set date to datetime
gwl_df = gwl_df.reset_index()
gwl_df['datetime'] = pd.to_datetime(gwl_df['datetime'])
# rename needed columns
rename_dict = {'datetime': 'Date', '72019_Maximum': 'Water_Level_ft'}
gwl_df = gwl_df.rename(columns=rename_dict)
# columns to retain
columns_to_keep = ['Date', 'Water_Level_ft']
gwl_df = gwl_df[columns_to_keep]
print("The count of missing values for date and ground water levels:")
gwl_df.isna().sum()
The count of missing values for date and ground water levels:
Date 0 Water_Level_ft 0 dtype: int64
print("The data types:")
gwl_df.dtypes
The data types:
Date datetime64[ns, UTC] Water_Level_ft float64 dtype: object
print("The dataset has", gwl_df.shape[0], "rows and", gwl_df.shape[1], "columns.")
The dataset has 7268 rows and 2 columns.
print("First 5 rows of data:")
gwl_df.head()
First 5 rows of data:
| Date | Water_Level_ft | |
|---|---|---|
| 0 | 2005-10-06 00:00:00+00:00 | 21.61 |
| 1 | 2005-10-07 00:00:00+00:00 | 21.62 |
| 2 | 2005-10-08 00:00:00+00:00 | 21.59 |
| 3 | 2005-10-09 00:00:00+00:00 | 21.64 |
| 4 | 2005-10-10 00:00:00+00:00 | 21.68 |
result = mk.original_test(gwl_df['Water_Level_ft'])
# Scatter for original groundwater levels
fig = go.Figure()
fig.add_trace(go.Scatter(
x=gwl_df['Date'],
y=gwl_df['Water_Level_ft'],
mode='lines',
name='Groundwater Levels',
line=dict(color='steelblue', width=2),
#opacity=0.7
))
# Rolling mean (30-day)
rolling_mean = gwl_df['Water_Level_ft'].rolling(30).mean()
fig.add_trace(go.Scatter(
x=gwl_df['Date'],
y=rolling_mean,
mode='lines',
name='30-day Rolling Mean',
line=dict(color='tomato', width=2)
))
# Add annotation for Mann-Kendall test result
mk_text = f"Trend: {result.trend}<br>Tau={result.Tau:.2f}, p={result.p:.3f}"
fig.add_annotation(
xref='paper', yref='paper',
x=0.5, y=1,
text=mk_text,
showarrow=False,
bordercolor="black",
borderwidth=1,
bgcolor="white",
#opacity=0.7
)
# Layout
fig.update_layout(
title="Groundwater Levels Over Time",
xaxis_title="Year",
yaxis_title="Groundwater Level (ft)",
template='simple_white',
width=900,
height=500,
legend=dict(bordercolor="gray", borderwidth=0.5)
)
fig.show()
The groundwater levels exhibit a strong, predictable seasonal pattern superimposed on a statistically significant but slight long-term decreasing trend over the observed period (2007-2025). Observation of a slight lowering of the oscillation pattern between 2012 and 2020, suggests lower ground water levels.
RQ5. Are there seasonal patterns in groundwater levels?¶
Distributions of ground water levels and One-Way ANOVA Test
gwl_df['month'] = gwl_df['Date'].dt.month
groups = [gwl_df[gwl_df['month'] == m]['Water_Level_ft'] for m in gwl_df['month'].unique()]
anova_result = f_oneway(*groups)
# Boxplot of Water Level by Month
fig = px.box(
gwl_df,
x='month',
y='Water_Level_ft',
color='month',
#color_continuous_scale='RdBu', # similar to "coolwarm"
labels={'month': 'Month', 'Water_Level_ft': 'Groundwater Level (ft)'},
title='Seasonal Variation in Groundwater Levels'
)
# Remove legend for colors if desired
fig.update_layout(showlegend=False)
# Add ANOVA annotation
anova_text = f"ANOVA F={anova_result.statistic:.2f}, p={anova_result.pvalue:.3f}"
fig.add_annotation(
xref='paper', yref='paper',
x=0.02, y=1,
text=anova_text,
showarrow=False,
bordercolor="black",
borderwidth=1,
bgcolor="white",
#opacity=0.7,
font=dict(size=12)
)
fig.update_layout(
width=900,
height=500,
template='simple_white',
xaxis=dict(title='Month'),
yaxis=dict(title='Groundwater Level (ft)')
)
fig.show()
The plot clearly demonstrates a highly significant seasonal cycle in groundwater levels. Levels consistently peak in the late fall/early winter (October-December) and reach their minimum in the early spring (March-May). This pattern is typical in many regions where recharge is dominant in cooler, wetter seasons and drawdown increases during the growing season.
The ANOVA result indicates that the differences in groundwater levels across the 12 months of the year are highly statistically significant. The null hypothesis—that the mean groundwater level is the same for all months—is strongly rejected. This confirms that there is a definite and measurable seasonal pattern.
RQ6. Are extreme highs or lows becoming more common?¶
# Extract year
gwl_df['year'] = gwl_df['Date'].dt.year
# Select last 3 years
last_years = sorted(gwl_df['year'].unique())[-3:]
subset = gwl_df[gwl_df['year'].isin(last_years)]
# Prepare data for each year
data = [subset[subset['year'] == y]['Water_Level_ft'] for y in last_years]
# Create KDE plots
fig = ff.create_distplot(
data,
group_labels=[str(y) for y in last_years],
show_hist=False, # no histogram bars
show_rug=False,
colors=['#66c2a5', '#fc8d62', '#8da0cb'], # Set2 palette
curve_type='kde'
)
fig.update_traces(line=dict(width=3))
# Layout
fig.update_layout(
title="Distribution of Groundwater Levels by Year",
xaxis_title="Groundwater Level (ft)",
yaxis_title="Density",
width=900,
height=500,
template='simple_white'
)
fig.show()
The plot demonstrates a shift in the distribution of groundwater levels:
2024 was a year where levels were most frequently high (concentrated at 22.5 ft).
2023 had a more balanced frequency between high and mid-range levels.
2025 (for the data available) shows a strong central tendency with levels primarily clustered in the mid-range (≈21.2 ft) and a lack of the high-level events seen in the previous two years. This is consistent with a year that has generally lower groundwater levels compared to 2024.
# group data by year
groups = [g['Water_Level_ft'].values for _, g in gwl_df.groupby('year')]
# Kruskal-Wallis test
kw_result = kruskal(*groups)
print("Kruskal-Wallis H-stat:", round(kw_result.statistic,2), "p-value:", kw_result.pvalue)
Kruskal-Wallis H-stat: 2047.56 p-value: 0.0
This result aligns with the prior analysis (especially the KDE plot comparing 2023, 2024, and 2025) which showed a clear shift in the distribution of levels between years. The Kruskal-Wallis test confirms that these visual differences are statistically robust, meaning the average water conditions change significantly from one year to the next in terms of overall level.